Hello and welcome to the course!
We are going to cover a lot of material on analysis of transcripts of individual cells in two formats, single-cell/nucleus and spatial transcriptomics. The concepts are hopefully generalizable to more types of data, depending on their format. In particular the Giotto object for spatial transcriptomics is already constructed for multiple types of spatial molecules beyond RNA.
This course assumes you have a working knowledge of R, but will
remove the more complex aspects of deploying these analyses at scale
(multithreading, gpus, batch scheduling, etc). For the most part we will
rely on two R concepts to work efficiently: piping steps together with
%>% and running functions on lists using
lapply. Additionally, it makes use of the
data.table object which is a much faster implementation of
a data.frame with built in data processing features.
Also note that many of the concepts here have analgous implementations in python in formats such as Scanpy and Squidpy.
First, lets load up some packages to get us going…
To begin, this course assumes you have a basic familiarity with single-cell analysis using the Seurat package. We won’t cover each step in huge detail, but leave it for you to explore. We can summarise the key steps here:
Today we are going to be using sample data from SEA-AD, generated by the Allen Institute. The cohort is deeply phenotyped, with clinical data, imaging data, single-nucleus sequencing, and MERSCOPE spatial transcriptomics per individual, giving us a lot of ing
This data is publicly available, and a really useful resource:
| AWS host |
|---|
For our purposes having matched snRNAseq and spatial transcriptomic data allows us to compare the the two data types in our analysis.
This will define standard settings you intend to run for the analysis
sc <- readRDS(file.path(data_path, "SEAAD_MTG_RNAseq_final-nuclei.2k_demo.rds"))
# do you have multiple conditions to integrate?
integ <- TRUE
# what column name contains the groups you are integrating?
g_col <- 'Group'
# how many principal components do you wish to use?
pcs <- 30
# what clustering resolution?
resi <- 0.8
The bulk of the processing can be done in rapid succession, with
linking %>% chains (in R >4.4, they are now using
|> pipes instead, but we’ll leave it for backward
compatibility)
if (integ) {
# split by group column
sc[["RNA"]] <- split(sc[["RNA"]], f = sc@meta.data[[g_col]])
# Prep & PCA on each group
sc %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData(vars.to.regress = c("Donor.ID", "Sex", "Age.at.Death")) %>%
RunPCA(npcs = pcs) %>%
RunUMAP(dims = 1:pcs, reduction.name = "umap") %>%
# feature selection and integration prep
IntegrateLayers(
method = HarmonyIntegration,
new.reduction = "harmony",
verbose = TRUE
) %>%
RunUMAP(
dims = 1:pcs,
reduction = "harmony",
reduction.name = "umap.harmony"
) %>%
FindNeighbors(reduction = "harmony", dims = 1:pcs) %>%
FindClusters(resolution = resi) %>%
# combine again
JoinLayers() -> sc
} else { # otherwise, ignore integration
sc %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData(vars.to.regress = c("Donor.ID")) %>%
RunPCA(npcs = pcs) %>%
FindNeighbors(dims = 1:pcs) %>%
FindClusters() %>%
RunUMAP(dims = 1:pcs) -> sc
}
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 24000
## Number of edges: 910599
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9407
## Number of communities: 33
## Elapsed time: 3 seconds
# # Save progress
# saveRDS(sc, file = "SEAAD_MTG_RNAseq_final-nuclei.2k_demo.rds")
# sc <- readRDS(
# file.path(data_path, "SEAAD_MTG_RNAseq_final-nuclei.2k_demo.rds")
# )
gc(verbose = FALSE)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5061559 270.4 9334738 498.6 9334738 498.6
## Vcells 480111423 3663.0 1374467962 10486.4 1689263053 12888.1
How did that integration turn out?
We can visualize with the two UMAPs if we did integration, or just plotting the standard UMAP if we did not.
Why did we split and integrate our experimental groups?
if (integ) {
p1 <- DimPlot(sc, reduction = "umap", group.by = "Donor.ID", shuffle = TRUE)
p2 <- DimPlot(sc, reduction = "umap.harmony", group.by = "Donor.ID", shuffle = TRUE)
p1 + p2 + plot_layout(guides = 'collect')
} else {
DimPlot(sc, reduction = "umap", group.by = "Donor.ID", shuffle = TRUE)
}
Generally, not a lot of improvement, the sequencing pipeline the Allen Inst. used seemed to be pretty consistent across donors.
Cell type identification is a class unto itself. Usually good methods involve a mix of automated cell type predictions, followed by careful curation to:
One useful tool for lightweight annotation prediction is the MapMyCells
web server. In our case, we already have expertly curated cell types,
defined by the Subclass column, as well as our experimental
phenotypes in the seurat object metadata, for every cell at
sc@meta.data.
What if I just wanted the the individual donor level phenotype data ?
# Selecting your umap to plot
if (integ) {
red <- "umap.harmony"
} else {
red <- "umap"
}
# plots by functional annotation
p1 <- DimPlot(sc, reduction = red, group.by = "Subclass", label = TRUE, repel = TRUE) +
ggtitle("clusters") +
NoLegend() +
theme(plot.title = element_text(hjust = 0.5))
p2 <- DimPlot(sc, reduction = red, group.by = g_col, shuffle = TRUE) +
theme(legend.position = 'inside', legend.position.inside = c(0.6, 0.95))
p1 + p2
Now that we have data, we can use Seurat’s built in tools to examine AD vs control differences. Notably, there are multiple methods for testing, with various options. Tools like MAST have some more stringent tests that also allow for inclusion of latent variables to correct for. Confounding variables like “Donor.ID” or “Sex” can impact your results!
For now, we are going for speed, running the standard Wilcoxon Rank Sum test (sped up with presto) using an lapply call to run each cell type in a list. We can then use data.table to stack them up and do fast operations on them like count the totals of DE genes (p_val_adjust < 0.05):
# which column contains cell types?
ctype <- 'Subclass'
# which column contains experimental groups?
group <- g_col
# standard DE analysis --------------------------------------------------------
Idents(sc) <- ctype
# running in a lapply loop to test each cell type (future.lapply can be even faster)
de <- lapply(levels(sc), function(i) {
data.table(
FindMarkers(sc,
ident.1 = 'ad',
ident.2 = 'ctrl',
group.by = group,
subset.ident = i,
# test.use = "MAST", # more stringent test, also allows the use of latent.vars
# latent.vars = c("Donor.ID", "Sex", "Age.at.Death"),
max.cells.per.ident = 1000
),
keep.rownames = TRUE
)
})
## Warning: The `slot` argument of `GetAssayData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `PackageCheck()` was deprecated in SeuratObject 5.0.0.
## ℹ Please use `rlang::check_installed()` instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Using data.table allows us to quickly/safely combine our de results into one table
setattr(de, 'names', levels(sc))
de <- rbindlist(de, idcol = ctype)
setnames(de, "rn", "Symbol")
# How many DE genes do we have?
de[p_val_adj < 0.05, .N, by = Subclass]
# # Let's save it
# fwrite(de, file = "SEAAD_MTG_RNAseq_final-nuclei.2k_demo.ad_vs_ctrl.txt.gz")
de <- fread(
file.path(data_path, "SEAAD_MTG_RNAseq_final-nuclei.2k_demo.ad_vs_ctrl.txt.gz")
)
However, when doing differential expression it is important to
consider that testing thousands of cells can inflate estimates of
significance. Do you have a highly significant change with an
avg_log2FC of 0.10 in less than 10% of your cell type
(pct.1 and pct.2). That may not reflect
biologically relevant variability. In general you can limit this with
certain options like limiting max.cells.per.ident for your
comparisons.
The most robust method is to construct pseudo-bulk datasets for a bulkRNAseq style measure with DESeq2. However, this can be an issue if you have few samples per group. In our case even with n == 6 per group there are only a few DE genes in any cells…
# # using pseudobulk analysis to control for within-sample correlation -----------
# pseudo <- AggregateExpression(sc,
# assays = "RNA",
# return.seurat = TRUE,
# group.by = c("Donor.ID", group, ctype)
# )
# pseudo@assays$RNA$counts <- round(pseudo@assays$RNA$counts)
#
# # add in some meta.data from the overall plot
# meta <- unique(data.table(sc@meta.data[, c("Donor.ID", "Sex", "Age.at.Death")]))
# # map
# Idents(pseudo) <- "Donor.ID"
# pseudo[["Sex"]] <- plyr::mapvalues(
# x = pseudo@active.ident,
# from = meta[["Donor.ID"]],
# to = meta[["Sex"]]
# )
# pseudo[["Age.at.Death"]] <- plyr::mapvalues(
# x = pseudo@active.ident,
# from = meta[["Donor.ID"]],
# to = meta[["Age.at.Death"]]
# )
#
# Idents(pseudo) <- ctype
# # running in a lapply loop to test each cell type (future.lapply can be even faster)
# de <- lapply(levels(pseudo), function(i) {
# data.table(
# FindMarkers(pseudo,
# ident.1 = 'ad',
# ident.2 = 'ctrl',
# group.by = group,
# subset.ident = i,
# latent.vars = c("Sex", "Age.at.Death"),
# test.use = "DESeq2"
# ),
# keep.rownames = TRUE
# )
# })
# # Using data.table allows us to quickly/safely combine our de results into one table
# setattr(de, 'names', levels(pseudo))
# de <- rbindlist(de, idcol = ctype)
# setnames(de, "rn", "Symbol")
#
# # How many DE genes do we have?
# de[p_val_adj < 0.05, .N, by = Subclass]
In the end, lets see what we got with some pictures
# Add a significant column with a nested ifelse
de[, sig := ifelse(
test = avg_log2FC < -0.5 & p_val_adj < 0.05,
yes = "down",
no = ifelse(
test = avg_log2FC > 0.5 & p_val_adj < 0.05,
yes = "up",
no = "ns"
)
)]
# Add color, size and alpha (transparency) to volcano plot --------------------
cols <- c("up" = "#ffad73", "down" = "#26b3ff", "ns" = "grey")
sizes <- c("up" = 1, "down" = 1, "ns" = 0.5)
alphas <- c("up" = 1, "down" = 1, "ns" = 0.5)
# we need to plot each cell type, so we again use an lapply function for each Subclass
plist <- lapply(sort(unique(de$Subclass)), function(i) {
ggplot(de[Subclass == i],
aes(x = avg_log2FC, y = -log10(p_val_adj), fill = sig, size = sig, alpha = sig)
) +
geom_point(shape = 21, color = "black") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-0.5, 0.5), linetype = "dashed") +
scale_fill_manual(values = cols) + # Modify point color
scale_size_manual(values = sizes) + # Modify point size
scale_alpha_manual(values = alphas) + # Modify point transparency
scale_x_continuous(
breaks = c(seq(-8, 8, 2)),
limits = c(-8, 8)
) +
ggtitle(i) + theme_classic() +
theme(
text = element_text(size = 5),
plot.title = element_text(hjust = 0.5),
legend.position = 'none'
)
})
# stick them all together
wrap_plots(plist) + plot_layout(guides = 'collect')
So, spatial transcritomics is any of several technologies that can put spatial coordinates on RNA data. RIght now there are at least a dozen different platforms, and that is just in the RNA space.
Previously Dr. McClatchy taught a very good workshop on the first generation of spatial data, using spot based sequencing. read the tutorial
What are some approaches to capturing spatial data?
The 30 second breakdown for major classes of spatial transcriptomics:
Spot-based sequencing across a slide, followed by sequencing
Probe-based imaging using serial washes of probes and imaging
Each brings it’s own caveats and costs, but for the present, the main limiting factor is number of genes. Spot-based systems can detect every gene you can sequence, grouped into spots/pixels/hexagons. Probe based systems you need a probe for each gene, but you get the physical location of the actual transcript, to the subcellular level. Currently a few companies are making advances in probe stability to allow thousands of probes, but most current efforts get a few hundred genes up to 1k.
What would you do with spatial information?
Spatial data of either type can be evaluated in four conceptual frameworks (so far):
This is were the different technologies show their big differences, as probe based allow for higher resolution information.
| Info | Spot-based | Probe-based |
|---|---|---|
| Spatial region detection | Yes | Yes |
| Spatial features diff expression | Yes | Yes |
| Cell-cell proximity | No | Yes |
| Interaction changed features | No | Yes |
| Ligand-receptor interaction | No | Yes |
More types of analysis are still developing, and the space can be described as…active. In particular one of the more challenging issues to address is how to analyze multiple samples and differential analysis across experimental groups. Most current analysis covers differences across regions within a single tissue sample. We will explore one particular methodology and talk briefly about the potential for novel approaches.
Part of the reason we chose the SEA-AD dataset as a reference, is because we can work on the same Donor samples across multiple technologies. The Allen Inst. included MERFISH data for the same region, MTG, for many participants, including our twelve folks. That way, we can do some compare/contrasting of the two datasets.
As with before, there are some global settings to include, and lets load things into memory.
# lets save some space in memory
remove(sc, pseudo)
## Warning in remove(sc, pseudo): object 'pseudo' not found
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5445389 290.9 9334738 498.6 9334738 498.6
## Vcells 488836531 3729.6 1374467962 10486.4 1689263053 12888.1
# define some global parameters (should still be similar to above)
# do you have multiple conditions to integrate?
integ <- TRUE
# what column name contains the groups you are integrating?
g_col <- 'Donor ID'
# how many principal components do you wish to use?
pcs <- 8
# what clustering resolution?
resi <- 0.4
# lets load in our giotto datasets
set_giotto_python_path()
## checking default envname 'giotto_env'
## a system default python environment was found
## Using python path:
## "/home/rrbutler/miniconda3/envs/giotto_env/bin/python"
## [1] "/home/rrbutler/miniconda3/envs/giotto_env/bin/python"
gobj <- loadGiotto(
file.path(data_path, "SEAAD_MTG_MERFISH.2024-12-11.giotto_obj")
)
## 1. read Giotto object
## 2. read Giotto feature information
## 3. read Giotto spatial information
## 4. read Giotto image information
## python already initialized in this session
## active environment : 'giotto_env'
## python version : 3.10
## checking default envname 'giotto_env'
## a system default python environment was found
## Using python path:
## "/home/rrbutler/miniconda3/envs/giotto_env/bin/python"
Giotto it is based on the terra spatial data package, and makes for objects that can store many different types of information:
So we can examine the Giotto object in detail to figure out the format.
What types of data does Giotto support?
# what am i?
gobj
## An object of class giotto
## >Active spat_unit: cell
## >Active feat_type: rna
## dimensions : 180, 38897 (features, cells)
## [SUBCELLULAR INFO]
## [AGGREGATE INFO]
## expression -----------------------
## [cell][rna] raw normalized scaled
## spatial locations ----------------
## [cell] raw
## dim reduction --------------------
## [cell][rna] pca harmony umap_harmony
## nearest neighbor networks --------
## [cell][rna] sNN.pca
##
##
## Use objHistory() to see steps and params used
# Giotto objects have instructions that save default settings
instructions(gobj)
## <giottoInstructions>
## python_path : /home/rrbutler/miniconda3/envs/giotto_env/bin/python
## show_plot : FALSE
## return_plot : TRUE
## save_plot : FALSE
## save_dir : /mnt/c/Users/rrbutler/OneDrive - Stanford/2025_Jax_workshop
## plot_format : png
## dpi : 300
## units : in
## height : 9
## width : 9
## is_docker : FALSE
## active_spat_unit : cell
## active_feat_type : rna
# Lets update those
instructions(gobj,
param = c("save_dir", "save_plot", "return_plot")
) <- list(getwd(), FALSE, TRUE)
We can see that “features” is not limited to RNA, but can be
multimodal. There are also spatial unit, which is currently
cell, but is similarly not contstrained. These are polygons
that can be drawn interactively (different workshop) and labeled.
What other types of polygons might fit in the spat_unit slot?
# finally, we are going to pull out some sample metadata for later
meta <- pDataDT(gobj)
meta <- unique(meta[, .(`Donor ID`, Group, Sex, `Age at Death`, `Continuous Pseudo-progression Score`)])
setnames(meta, "Continuous Pseudo-progression Score", "Pseudo_prog")
meta$`Donor ID` <- factor(meta$`Donor ID`, levels = meta$`Donor ID`)
meta[`Donor ID` == "H21.33.015", `Age at Death` := "98"]
meta$`Age at Death` <- as.integer(as.character(meta$`Age at Death`))
meta
Probe based spatial technologies allow exact RNA positioning, and
immunostained images to be underlayed behind your cell
spat_unit labels, for very cool images at real sub-cellular
resolution!
g <- GiottoData::loadGiottoMini("cosmx", verbose = FALSE)
spatInSituPlotPoints(g,
show_image = TRUE,
image_name = c("fov002-composite", "fov003-composite"),
feats = list(fDataDT(g)[, feat_ID]),
point_size = 0.25,
show_legend = FALSE,
polygon_color = 'white'
)
## plot image layer done
## --| Plotting 18749 feature points
## Warning in getDistinctColors(n = n):
## not enough unique colors in R, maximum = 444
## plot feature points layer done
## plot polygon layer done
Sadly, our MERFISH data will limit RNA counts to the cell level, where each point is a cell with ~ 180 gene expression values.
Rather than use all 12 donors for this example, we have the three
matched ad and ctrl samples (as best as we
can) with strong differential pseudo-progression scores and
Overall Neuropathological Change. To get these from the
overall dataset takes some work, but for now lets examine how they look
in space.
Do you see any notable spatial features?
p <- spatPlot(gobj,
cell_color = "Subclass",
point_size = 1,
cell_color_code = as.vector(palette.colors(n = 24, palette = "Alphabet"))
)
p + labs(subtitle = paste(meta$`Donor ID`, collapse = " ")) +
theme(legend.position = 'bottom', plot.subtitle = element_text(hjust = 0.5))
Each of these blocks represents a section through the cortex, from L1 –> white matter at the bottom. Layers of neurons tend to be easily recognized transcriptomically, and this has been the case across multiple datasets (Yao 2021, Yao 2024, Lots more can be included). This actually extends to brain regions as well, as the Allen Institute’s recent series of papers (Yao 2024, Merscope paper 2024). This can make it sort of easy to compare Layers between single-cell RNAseq and spatial RNAseq, though there is no replacement for the actual spatial information.
Bonus: How would I identify the
Donor IDif I was trying to look at it?
Thinking back to the single-cell data, there were a series of steps:
Do those still apply for spatial data?
For the most part, excepting that we are integrating using
g_col, which if you notice has changed to
"Donor ID". We’ll cover why in a bit, but here’s a big
clue. We also can actually use spatial information in our
Neighbors/Clusters step to make our clustering more constrained to
spatial zones.
You may also notice that we are no longer regressing out donor level
stats like Age at Death since samples are being integrated
at the donor level for this step. Downstream analyses will examine each
slice separately, and include these covariates in differential
meta-analysis. Should you want to try and merge samples for a more
complex analysis, you can adjust for additional covariates/batches using
adjustGiottoMatrix.
This data has already been cleaned/filtered a fair bit by the SEA-AD group, but for informational purposes, lets look first at what a giotto object is and then filter/normalize
gobj %>%
# filter out cells with too few features, or features in too few cells
filterGiotto(feat_det_in_min_cells = 5, min_det_feats_per_cell = 5) %>%
# normalize using scale factor (depends on the technology)
normalizeGiotto(scalefactor = 10000, verbose = TRUE) %>%
# calculate feature and cell statistics on normalized data
addStatistics(expression_values = "normalized") %>%
# with MERSCOPE there are very few genes, so we specify highly variable features to use
calculateHVF(method = "var_p_resid", var_number = 120, show_plot = TRUE) -> gobj
## completed 1: preparation
## completed 2: subset expression data
## completed 3: subset spatial locations
## completed 4: subset cell metadata
## completed 5: subset feature metadata
## completed 6: subset spatial network(s)
## completed 7: subsetted dimension reductions
## completed 8: subsetted nearest network(s)
## completed 9: subsetted spatial enrichment results
##
## Feature type: rna
## Number of cells removed: 0 out of 38897
## Number of feats removed: 0 out of 180
## first scale feats and then cells
## > normalized already exists and will be replaced with new values
## Setting expression [cell][rna] normalized
## > scaled already exists and will be replaced with new values
## Setting expression [cell][rna] scaled
## calculating statistics for "normalized" expression
## feat statistics has already been applied once; overwriting
## cells statistics has already been applied once; overwriting
## hvf has already been used, will be overwritten
##
## These column names were already used: var
## and will be overwritten
Now that our normalized data is ready, lets make some clusters in the
spatial style. A notable difference with other platforms like
CosMx/Xenium/scRNAseq is that there are few total genes. So automated
selection of variable genes for principal components doesn’t work. We
can specify a theshold using calculateHVF. We also need to
integrate our all of our sample slides using
Harmony.
Why integrate all samples instead of just some ?
gobj %>%
runPCA() %>%
# screePlot() %>% # with this few genes, 8 pcs is sufficient
runGiottoHarmony(
vars_use = g_col,
dimensions_to_use = 1:pcs,
) %>%
runUMAP(
dim_reduction_name = "harmony",
dim_reduction_to_use = "harmony",
name = "umap_harmony",
dimensions_to_use = 1:pcs) -> gobj
## using 'Harmony' to integrate different datasets. If used in
## published research, please cite:
## Korsunsky, I., Millard, N., Fan, J. et al.
## Fast, sensitive and accurate integration of single-cell data with Harmony.
## Nat Methods 16, 1289-1296 (2019).
## https://doi.org/10.1038/s41592-019-0619-0
## "hvf" column was found in the feats metadata information and will be
## used to select highly variable features
## Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
## TRUE, : You're computing too large a percentage of total singular values, use a
## standard svd instead.
## > pca already exists and will be replaced with
## new dimension reduction object
## Setting dimension reduction [cell][rna] pca
## Transposing data matrix
## Initializing state using k-means centroids initialization
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony 4/10
## Harmony converged after 4 iterations
## harmony has already been used with harmony, will be overwritten
## > umap_harmony already exists and will be replaced with
## new dimension reduction object
## Setting dimension reduction [cell][rna] umap_harmony
When clustering with spatial data, we can actually incorporate spatial information in those clusters. Leiden clustering makes use of a spatial nearest neighbor network when clustering cells. The workshop from the Giotto group actually shows what happens if you don’t integrate your datasets with two slides, the major groupings are actually the slides themselves!
gobj %>%
# Create a sNN network (default)
createNearestNetwork(dimensions_to_use = 1:pcs, k = 15) %>%
# Perform clustering
doLeidenCluster(resolution = resi, name = paste0("leiden_", resi)) -> gobj
## > 'sNN.pca' already exists and will be replaced with
## new nearest neighbor network
## leiden_0.4 has already been used, will be overwritten
# Plot the clusters upon the UMAP
p1 <- plotUMAP(gobj,
cell_color = paste0("leiden_", resi),
title = paste0("leiden_", resi),
point_size = 1,
dim_reduction_name = "umap_harmony",
show_NN_network = TRUE,
show_legend = FALSE,
cell_color_code = as.vector(palette.colors(n = 24, palette = "Alphabet"))
)
# Plot the Subclasses upon the UMAP
p2 <- plotUMAP(gobj,
cell_color = "Subclass",
title = "Subclass",
point_size = 1,
dim_reduction_name = "umap_harmony",
show_NN_network = TRUE,
show_legend = FALSE,
cell_color_code = as.vector(palette.colors(n = 24, palette = "Alphabet"))
)
# plot together
p1 + p2
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Additionally, Leiden clustering in Giotto has the ability to
doLeidenSubCluster on individual clusters to more finely
split them up without changing the overall cluster resolution. This can
allow you to subdivide more heterogeneous populations dynamically as
needed. Looking at the SEA-AD annotations, several groups (like
GABAergic neuron subtypes Pvalb, Sst, etc) could be further split into
subtypes.
But for now, we can also see how the automated clustering did across the different slides:
p <- spatPlot(gobj,
cell_color = paste0("leiden_", resi),
point_size = 1,
cell_color_code = as.vector(palette.colors(n = 24, palette = "Alphabet"))
)
p + labs(subtitle = paste(meta$`Donor ID`, collapse = " ")) +
theme(legend.position = 'bottom', plot.subtitle = element_text(hjust = 0.5))
Notably the purple, beige and
red colored clusters that are L2/3 IT neurons are in the
correct layer. The hot pink set that is L4 IT neurons are a
little harder to see in the picture, but are in the middle.
From this point on, we can also save the Giotto objects, as we won’t be modifying it anymore, just analyzing it.
# saveGiotto(gobj,
# foldername = "SEAAD_MTG_MERFISH.2024-12-11.giotto_obj",
# method = 'qs',
# overwrite = FALSE
# )
The dataset already has well characterized cell types included, but let us take a look at the standard workflow for marker genes for identifying clusters.
# find gini markers (can also use scran or Mast)
markers_gini <- findMarkers_one_vs_all(gobj,
method = "gini",
cluster_column = paste0("leiden_", resi),
min_feats = 10
)
## start with cluster 1start with cluster 2start with cluster 3start with cluster 4start with cluster 5start with cluster 6start with cluster 7start with cluster 8start with cluster 9start with cluster 10start with cluster 11start with cluster 12start with cluster 13start with cluster 14start with cluster 15start with cluster 16start with cluster 17start with cluster 18
# select genes for cluster heatmap
top_gini <- markers_gini[!feats %like% "Blank", head(.SD, 6), by = "cluster"]
p <- plotMetaDataHeatmap(gobj,
selected_feats = unique(top_gini$feats),
custom_feat_order = unique(top_gini$feats),
custom_cluster_order = unique(top_gini$cluster),
metadata_cols = paste0("leiden_", resi),
x_text_size = 8,
y_text_size = 4
)
p
Up until now, we have been able to look at the combined set of all slides. However, in order to understand spatial relationships it is more useful to examine each slide individually and then meta-analyze them. Otherwise we can encounter a similar problem to snRNAseq, where tens of thousands of observations overwhelm significance. Additionally, spatial relationships are going to be scored differently in joined images of multiple slides versus in isolation. So to handle this lets split them up into a list of gobjs.
gobjs <- splitGiotto(gobj, by = g_col)
# We have to create a spatial network for each slide based on physical distance of cell centroids
gobjs <- lapply(gobjs, function(i) {
i %>%
createSpatialNetwork() %>%
createSpatialGrid(
sdimx_stepsize = 500,
sdimy_stepsize = 500,
minimum_padding = 50
) -> i
})
## Setting spatial network [cell] Delaunay_network
## Setting spatial network [cell] Delaunay_network
## Setting spatial network [cell] Delaunay_network
## Setting spatial network [cell] Delaunay_network
## Setting spatial network [cell] Delaunay_network
## Setting spatial network [cell] Delaunay_network
# example of a spatial networks (small window)
smallfov <- subsetGiottoLocs(gobjs[[1]],
x_max = 500,
x_min = 0,
y_max = 3000,
y_min = 2000
)
spatPlot(smallfov,
show_network = TRUE,
network_color = "blue",
point_size = 1,
cell_color = "Subclass",
show_plot = TRUE
)
Multiple samples per group
If you don’t have multiple images per group, you can rely on the default output for the single image, you will simply not have the extra layer of statistical confidence on between group differences.
Now we can find spatially organized gene expression by examining the binarized expression of cells and their spatial neighbors. The full description is here
# identify spatial genes
spat_feats <- lapply(gobjs, binSpect)
##
## This is the single parameter version of binSpect
##
## 1. matrix binarization complete
##
## 2. spatial enrichment test completed
##
## 3. (optional) average expression of high
## expressing cells calculated
##
## 4. (optional) number of high expressing cells
## calculated
##
## This is the single parameter version of binSpect
##
## 1. matrix binarization complete
##
## 2. spatial enrichment test completed
##
## 3. (optional) average expression of high
## expressing cells calculated
##
## 4. (optional) number of high expressing cells
## calculated
##
## This is the single parameter version of binSpect
##
## 1. matrix binarization complete
##
## 2. spatial enrichment test completed
##
## 3. (optional) average expression of high
## expressing cells calculated
##
## 4. (optional) number of high expressing cells
## calculated
##
## This is the single parameter version of binSpect
##
## 1. matrix binarization complete
##
## 2. spatial enrichment test completed
##
## 3. (optional) average expression of high
## expressing cells calculated
##
## 4. (optional) number of high expressing cells
## calculated
##
## This is the single parameter version of binSpect
##
## 1. matrix binarization complete
##
## 2. spatial enrichment test completed
##
## 3. (optional) average expression of high
## expressing cells calculated
##
## 4. (optional) number of high expressing cells
## calculated
##
## This is the single parameter version of binSpect
##
## 1. matrix binarization complete
##
## 2. spatial enrichment test completed
##
## 3. (optional) average expression of high
## expressing cells calculated
##
## 4. (optional) number of high expressing cells
## calculated
spat_feats <- rbindlist(spat_feats, idcol = g_col)
spat_feats <- spat_feats[!feats %like% "Blank"]
# fwrite(spat_feats, file = "SEAAD_MTG_MERFISH.2024-12-11.spat_feats_raw.txt.gz")
# spat_feats <- fread(
# file.path(data_path, "SEAAD_MTG_MERFISH.2024-12-11.spat_feats_raw.txt.gz")
# )
spat_feats
Key values in this analysis are the score,
av_expr, high_expr:
score: estimate * -log(p.value), the
measure of the “spatialness of the gene”av_expr: the average expression high-expressing
geneshigh_expr: the total number of high expressing
genesSo a simple solution is to simply look at the top ranked spatial
genes, these genes have some manner of spatial grouping where the gene
is most expressed, and is great for exploring tissue heterogeneity. The
first, as a rank order, can be combined using RobustRankAgg
for each of our experimental groups. The second we want to test for a
difference between ad and control groups.
# first, lets calculate the rank aggregate spatial score for each `status`
agg <- merge(
x = aggregateRanks(
lapply(meta[Group == 'ad', `Donor ID`], function(i) {
spat_feats[order(-score)][`Donor ID` == i, feats]
})
),
y = aggregateRanks(
lapply(meta[Group == 'ctrl', `Donor ID`], function(i) {
spat_feats[order(-score)][`Donor ID` == i, feats]
})
),
by = 'Name',
suffixes = c('.ad', '.ctrl')
)
setorder(agg, Score.ad, Score.ctrl)
agg
# plot the top aggregate score for ad
top <- agg[1, "Name"]
p_list <- lapply(names(gobjs), function(i) {
p <- spatFeatPlot2D(gobjs[[i]],
expression_values = "scaled",
feats = top,
point_shape = "border",
point_border_stroke = 0.1,
show_network = FALSE,
network_color = "lightgrey",
point_size = 1,
scale_alpha_with_expression = TRUE,
cow_n_col = 1,
save_plot = FALSE,
return_plot = TRUE
)
p + scale_color_distiller(name = "RdBu", limits = c(-3, 3)) + ggtitle(paste(i, top))
})
wrap_plots(p_list)
# was that gene a de gene?
de[Symbol == top][order(p_val_adj)]
So the top gene is pretty highly expressed in the L2/3 layer, regardless of disease status. Even more apparent when you pop it out in a new viewer window. In fact, it appears most spatial genes have similar agg ranks in either case, which you might expect from a very layered tissue like the cortex.
Now let’s look at the top changing genes across the two conditions.
We can use a common DE package that can handle log transformed values
like limma.
# to test differences in average expression in high cells, use limma's lmFit ---
# make a table of the av_expr for each Donor
exp <- data.frame(
dcast(spat_feats, feats ~ `Donor ID`, value.var = 'av_expr'),
row.names = 'feats'
)
# reorder and set the disease status for limma
exp <- exp[, levels(meta$`Donor ID`)]
design <- model.matrix(~ Group + Sex + `Age at Death`, data = meta)
# sample level ad vs ctrl
spat_lm <- as.data.table(
topTable(eBayes(lmFit(exp, design)), number = nrow(exp), coef = 2),
keep.rownames = "feats"
)
# merge with aggregate spatial scores
spat_lm <- merge(spat_lm, agg, by.x = 'feats', by.y = 'Name')
setorder(spat_lm, adj.P.Val, P.Value)
fwrite(spat_lm, file = "SEAAD_MTG_MERFISH.2024-12-11.spat_feats_agg.txt.gz")
# spat_lm <- fread(
# file.path(data_path, "SEAAD_MTG_MERFISH.2024-12-11.spat_feats_agg.txt.gz")
# )
spat_lm
# plot the top score changes between groups
top <- "TACR1"
p_list <- lapply(names(gobjs), function(i) {
p <- spatFeatPlot2D(gobjs[[i]],
expression_values = "scaled",
feats = top,
point_shape = "border",
point_border_stroke = 0.1,
show_network = FALSE,
network_color = "lightgrey",
point_size = 1,
scale_alpha_with_expression = TRUE,
cow_n_col = 1,
save_plot = FALSE,
return_plot = TRUE
)
p + scale_color_distiller(name = "RdBu", limits = c(-3, 3)) + ggtitle(paste(i, top))
})
wrap_plots(p_list)
# How about this one, DE?
de[Symbol == top][order(p_val_adj, p_val)]
So notably, spatial genes can change in two ways:
Similar to bulkRNAseq data, you can also create co-expression modules of correlated spatial genes so see which have similar patterns, and develop metagenes analogous to eigengenes for each module. In cases like an organized brain though, these are likely to overlap with your existing cluster annotations. Lastly, you can use these spatial genes with a hidden markov random field (HMRF) model to segment cell clusters by spatial domain instead of traditional Leiden clustering.
Strictly speaking, there is a third spatial gene alteration to consider, which is that the spatial shape or size might change in a way that doesn’t alter it’s spatial rank, but that would be a much rarer and more challenging thing to identify without visual inspection…
So, on to the next plane of spatial information. Since we have cell types, we can define fairly easily which cells are next to one another. Are L2/3 IT neurons next to L4 IT neurons? Certainly a fair few are.
But what about the more mobile cell types like glia? Does that change with your condition?
cell_prox <- lapply(gobjs, cellProximityEnrichment,
cluster_column = "Subclass",
adjust_method = "fdr"
)
# saveRDS(cell_prox, file = "SEAAD_MTG_MERFISH.2024-12-11.cell_prox_raw.rds")
# cell_prox <- readRDS(
# file.path(data_path, "SEAAD_MTG_MERFISH.2024-12-11.cell_prox_raw.rds")
# )
# barplot
p_list <- lapply(c(1, length(gobjs)), function(i) {
cellProximityBarplot(gobjs[[i]],
CPscore = cell_prox[[i]]
)
})
wrap_plots(p_list) +
plot_layout(guides = 'collect') +
plot_annotation(tag_levels = list(levels(meta$Group)))
In this case, the same spatial network is used, but you are calculating
the number of cell-cell pairings observed versus expected with
permutations of random re-shuffling of the cell type labels on your
spatial network. A given pairing can either be enriched (more likely
than expected) or depleted (less). This can be shown as a ranked
barplot, a heatmap, or a network diagram.
homotype entries same-cell interactions.Do any of the cell types seem to stick together?
We can simplify the results table to understand where the numbers are coming from.
# lets simplify the results so we can take a closer look
prox_tab <- lapply(cell_prox, function(i) i[[2]])
prox_tab <- rbindlist(prox_tab, idcol = g_col)
prox_tab[order(-PI_value)]
Key values in this analysis are the unified_int,
enrichm, PI_value:
unified_int: Interacting pair, either type
hetro or homoenrichm:
log2((original + 1) / (simulations + 1)) enrichment ratio,
i.e. observed over the expected frequency of cell-cell proximity
interactionsPI_value: ranking score, basically
-log10(p.adj + 1/n_perm) * enrichmSo similar to before we have an effect size (enrichm)
and a score (PI_value), but unlike with spatial genes we do
not have an average expression value to use. Also notice, the
PI_value has a sign, so strongly depleted values would rank
last! In this case an average PI_value for each condition
would be more informative, and then we can test the actual
enrichm across groups for differential effect.
# first, lets calculate the mean PI_value for each group
agg <- merge(
x = prox_tab[
`Donor ID` %in% meta[Group == 'ad', `Donor ID`],
.(avg_PI.ad = mean(PI_value)),
by = unified_int
],
y = prox_tab[
`Donor ID` %in% meta[Group == 'ctrl', `Donor ID`],
.(avg_PI.ctrl = mean(PI_value)),
by = unified_int
]
)
# now lets check if any interactions showed an average difference
agg[, dPI.ad.vs.ctrl := avg_PI.ad - avg_PI.ctrl]
setorder(agg, -dPI.ad.vs.ctrl)
agg
Now let’s look at the top changing genes across the two conditions.
This time we won’t have log transformed expression values, so we will
have to rely on a generalized linear model for logistic regression
(ad vs ctrl)
# make a table of all of the data for each Donor (rows)
exp <- dcast(prox_tab, `Donor ID` ~ unified_int, value.var = 'PI_value')
exp <- merge(meta, exp)
names(exp) <- make.names(names(exp)) # glm has a thing with dashes in names
# make output table
prox_lm <- unique(prox_tab[, .(unified_int = make.names(unified_int))])
# fast run function
prox_lm[,
c("estimate", "se", "z", "p", "nobs") := get_linreg(
dat = exp,
outcome = "Group",
exposure = unified_int,
covs = c("Sex", "Age.at.Death"),
fam = binomial(link = logit)
),
by=.I
]
prox_lm[, padj := p.adjust(p, method = "fdr")]
prox_lm[order(p)]
What happened?
This is a situation referred to as complete separation. There is a covariate or combination of covariates that completely separates the outcome variable. You can try rerunning it without either (or both) the Age and Sex covariates, as dropping each separately fixes some of the overlarge estimates.
In reality, you would not be running three samples per group with to known disease modifying covariates. Luckily we happen to have an ace in the hole. Instead of running logistic regression, we can use pseudo-progression as a linear outcome.
# make output table
prox_lm <- unique(prox_tab[, .(unified_int = make.names(unified_int))])
# Try this again, but with a linear regression
prox_lm[,
c("estimate", "se", "z", "p", "nobs") := get_linreg(
dat = exp,
outcome = "Pseudo_prog",
exposure = unified_int,
covs = c("Sex", "Age.at.Death")
),
by=.I
]
prox_lm[, padj := p.adjust(p, method = "fdr")]
# fix those names
prox_lm[, unified_int := gsub("..", "--", unified_int, fixed = TRUE)]
prox_lm[, unified_int := gsub("ia.PV", "ia-PV", unified_int, fixed = TRUE)]
prox_lm[, unified_int := gsub("L2.3", "L2/3", unified_int, fixed = TRUE)]
prox_lm[, unified_int := gsub("L5.6", "L5/6", unified_int, fixed = TRUE)]
prox_lm[, unified_int := gsub(".", " ", unified_int, fixed = TRUE)]
# merge with aggregate spatial scores
prox_lm <- merge(prox_lm, agg, by = 'unified_int')
setorder(prox_lm, padj, p)
# fwrite(prox_lm, file = "SEAAD_MTG_MERFISH.2024-12-11.cell_prox_agg.txt.gz")
# prox_lm <- fread(
# file.path(data_path, "SEAAD_MTG_MERFISH.2024-12-11.cell_prox_agg.txt.gz")
# )
prox_lm[order(p)]
# lets look at some numbers of each type of cells to see if that is impacting the result
counts <- pDataDT(gobj)[, .N, by = .(Subclass, `Donor ID`)]
dcast(counts, Subclass ~ `Donor ID`)
## Using 'N' as value column. Use 'value.var' to override
# plot it proportionally
ggplot(counts, aes(x = `Donor ID`, y = N, fill = Subclass)) +
geom_bar(stat = 'identity', position = 'fill') +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values = as.vector(palette.colors(n = 24, palette = "Alphabet")))
Proximity changes are potentially more relevant to other disease types such as cancer, where the cancer cells themselves may form a body around which other cells may gravitate. We can see that the top scorers are not padj significant, but can still take a look and see why they may be interesting.
First up makes little sense in terms of the types suggested, as L6 and L2/3 should rarely if ever overlap. But looking at the cell counts per sample, these are robust populations. So lets visualize the connections.
# plot the connections between interesting pairs
top <- "L2/3 IT--L6 IT Car3"
p_list <- lapply(names(gobjs), function(i) {
p <- cellProximitySpatPlot2D(gobjs[[i]],
interaction_name = top,
show_network = TRUE,
cluster_column = "Subclass",
cell_color = "Subclass",
network_color = 'black',
cell_color_code = c(`L2/3 IT` = "lightblue", `L6 IT Car3` = "red"),
point_size_select = 1,
point_size_other = 1,
save_plot = FALSE,
return_plot = TRUE
)
p + ggtitle(paste(i, meta[`Donor ID` == i, Group], sep = " - "))
})
# plot all together
wrap_plots(p_list) + plot_layout(guides = 'collect')
# # or separately
# p_list
Given how few cells are actually connecting in our sample relative to how many of each cell type are present, it seems more likely an annotation artifact.
In the case of L5 ET is a low count cell type (which can
produce some inflated variance in estimated occurrence).
One last exploration is to examine the actual connections in the
tissue. Lets pick one of the hetero matches with larger
numbers, Astrocyte--Oligodendrocyte
# plot the connections between interesting pairs
top <- "Astrocyte--Oligodendrocyte"
p_list <- lapply(names(gobjs), function(i) {
p <- cellProximitySpatPlot2D(gobjs[[i]],
interaction_name = top,
show_network = TRUE,
cluster_column = "Subclass",
cell_color = "Subclass",
network_color = 'black',
cell_color_code = c(`Astrocyte` = "lightblue", `Oligodendrocyte` = "red"),
point_size_select = 1,
point_size_other = 1,
save_plot = FALSE,
return_plot = TRUE
)
p + ggtitle(paste(i, meta[`Donor ID` == i, Group], sep = " - "))
})
# plot all together
wrap_plots(p_list) + plot_layout(guides = 'collect')
# # or separately
# p_list
This one is interesting at least in that there are a fair few
interactions, and a hint of something different in the lower layers for
ad. What we would need to be certain is higher number of
samples.
Moving onto our third category, interaction changed features. Rather than specifically labeling dynamic changes in cell location, we can ask are there any genes in a given cell that change in expression when it is in proximity to another type of cell. Do neurons talk differently when microglia are around? We can test this for every cell type and it’s proximity to every other cell type, for every gene, but that becomes…taxing. So for now lets look at the most expressed genes.
# select top 25th highest expressing genes
gene_meta <- fDataDT(gobj)
high_exp <- gene_meta[
mean_expr_det > quantile(mean_expr_det, probs = 0.75),
feat_ID
]
gene_meta[, labs := ifelse(mean_expr_det > quantile(mean_expr_det, probs = 0.75), feat_ID, NA)]
# which look like?
ggplot(gene_meta, aes(x = nr_cells, y = mean_expr_det, label = labs)) +
geom_point() +
geom_text_repel(max.overlaps = 20) +
theme_classic()
## Warning: Removed 135 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).
To move more efficiently, we can utilize multiple cores. Even still, at full bore with a subset of genes this takes 15-20 minutes! Cluster computing becomes really essential for these last spatial data types, as they will run for hours/days. I lowered it to 100 permutations just to try it out, but we then load the saved result.
## identify genes that are associated with proximity to other cell types
plan("multisession", workers = 8)
# ICFs <- lapply(gobjs, findInteractionChangedFeats,
# selected_feats = high_exp,
# cluster_column = "Subclass",
# # diff_test = "permutation",
# adjust_method = "fdr",
# nr_permutations = 10,
# do_parallel = TRUE
# )
# saveRDS(ICFs, file = "SEAAD_MTG_MERFISH.2024-12-11.icf_raw.rds")
ICFs <- readRDS(
file.path(data_path, "SEAAD_MTG_MERFISH.2024-12-11.icf_raw.rds")
)
# lets filter and simplify the results so we can take a closer look
ICFscoresFilt <- lapply(ICFs, filterICF,
min_cells = 10,
min_int_cells = 10,
min_cells_expr = 2,
min_int_cells_expr = 2,
min_fdr = 0.05,
min_spat_diff = 0.2,
min_log2_fc = 0.5,
min_zscore = 2
)
ICFscoresFilt <- lapply(ICFscoresFilt, function(i) i[['ICFscores']])
ICFscoresFilt <- rbindlist(ICFscoresFilt, idcol = g_col)
ICFscoresFilt[order(zscores)]
Now it is getting really tricky to interpret. Here are some key output metrics:
sel: average feature expression in the interacting
cells from the target cell typeother: average feature expression in the
NOT-interacting cells from the target cell typelog2fc: log2 fold-change between sel and otherdiff: spatial expression difference between sel and
otherp.adj: adjusted p-valuecell_type: target cell typeint_cell_type: interacting cell typeYou are doing a statistical test in each sample to test whether gene expression is different in interacting target cells versus non interacting target cells. In this case permutations of cell type labels are similar to before, but you are NOT testing your value directly against random permutations.
We are also introducing a new concept, target and
interacting cell types.
Finally, to standardize this, we filter the large number of target-interactor-feature combinations, and then z-scale against all cell types in a given sample to control for sample to sample comparison.
# get the top scoring set from any sample
top <- unique(ICFscoresFilt[, .(feats, cell_type, int_cell_type)])
# get the raw stats for those from EVERY sample
ICFscoresFilt <- lapply(ICFs, filterICF,
min_cells = 10,
min_int_cells = 10,
min_cells_expr = 0,
min_int_cells_expr = 0,
min_fdr = 1.01,
min_spat_diff = 0,
min_log2_fc = 0,
min_zscore = 0
)
# subset of only those in teh top list
all_top_scores <- lapply(ICFscoresFilt, function(i) {
x <- merge(i$ICFscores, top, by = c("feats", "cell_type", "int_cell_type"))
})
all_top_scores <- rbindlist(all_top_scores, idcol = g_col)
Now that we have z-scores and avg log2FC for all the samples lets aggregate that
# first, lets calculate the mean values for each group
all_top_scores[, feat_by_int := paste(feats, cell_type, int_cell_type, sep = "_")]
agg <- merge(
x = all_top_scores[
`Donor ID` %in% meta[Group == 'ad', `Donor ID`],
.(avg_L2FC.ad = mean(log2fc), avg_zscore.ad = mean(zscores)),
by = "feat_by_int"
],
y = all_top_scores[
`Donor ID` %in% meta[Group == 'ctrl', `Donor ID`],
.(avg_L2FC.ctrl = mean(log2fc), avg_zscore.ctrl = mean(zscores)),
by = "feat_by_int"
]
)
# now lets check if any interactions showed an average difference
agg[, dL2FC.ad.vs.ctrl := avg_L2FC.ad - avg_L2FC.ctrl]
agg[, dzscore.ad.vs.ctrl := avg_zscore.ad - avg_zscore.ctrl]
setorder(agg, -dL2FC.ad.vs.ctrl)
agg
And finally, lets test the significant of the change in log2fc of expression with interactor versus without.
# make a table of all of the data for each Donor (rows)
exp <- dcast(all_top_scores, feat_by_int ~ `Donor ID`, value.var = 'log2fc')
exp <- exp[complete.cases(exp)]
exp <- merge(meta, transpose(exp, keep.names = "Donor ID", make.names = "feat_by_int"))
names(exp) <- make.names(names(exp)) # glm has a thing with dashes in names
# make output table
icf_lm <- unique(all_top_scores[, .(feat_by_int = make.names(feat_by_int))])
icf_lm <- icf_lm[feat_by_int %in% names(exp)]
# Try this again, but with a linear regression
icf_lm[,
c("estimate", "se", "z", "p", "nobs") := get_linreg(
dat = exp,
outcome = "Pseudo_prog",
exposure = feat_by_int,
covs = c("Sex", "Age.at.Death")
),
by=.I
]
icf_lm[, padj := p.adjust(p, method = "fdr")]
# fix those names
icf_lm[, feat_by_int := gsub("..", "--", feat_by_int, fixed = TRUE)]
icf_lm[, feat_by_int := gsub("ia.PV", "ia-PV", feat_by_int, fixed = TRUE)]
icf_lm[, feat_by_int := gsub("L2.3", "L2/3", feat_by_int, fixed = TRUE)]
icf_lm[, feat_by_int := gsub("L5.6", "L5/6", feat_by_int, fixed = TRUE)]
icf_lm[, feat_by_int := gsub(".", " ", feat_by_int, fixed = TRUE)]
# merge with aggregate spatial scores
icf_lm <- merge(icf_lm, agg, by = 'feat_by_int')
setorder(icf_lm, padj, p)
# fwrite(icf_lm, file = "SEAAD_MTG_MERFISH.2024-12-11.icf_agg.txt.gz")
# icf_lm <- fread(
# file.path(data_path, "SEAAD_MTG_MERFISH.2024-12-11.icf_agg.txt.gz")
# )
icf_lm[order(p)]
Let’s plot this in one sample to visualize what we are looking at.
# grab the unfiltered top feat_by_int set
filt_top_scores <- lapply(ICFscoresFilt, function(i) i[['ICFscores']])
filt_top_scores <- rbindlist(filt_top_scores, idcol = g_col)
filt_top_scores[, lab := ifelse(p.adj < 0.05, "*", "")]
filt_top_scores$`Donor ID` <- factor(filt_top_scores$`Donor ID`, levels = meta$`Donor ID`)
# L2/3 IT as the target cell type
myPalette <- colorRampPalette(rev(RColorBrewer::brewer.pal(11, "RdBu")))
p1 <- ggplot(
data = filt_top_scores[
cell_type == "L2/3 IT"
& int_cell_type == "Pvalb"
& feats %in% c("RORB", "MOG", "RGS6")
],
mapping = aes(x = `Donor ID`, y = feats, fill = log2fc, label = lab)) +
geom_tile() +
geom_text(size = 10) +
scale_fill_gradientn(colors = myPalette(100)) +
labs(x = "Donor ID", title = "L2/3 IT Genes interacting with Pvalb") +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
plot.title = element_text(hjust = 0.5, face = 'bold')
)
p1
# L2/3 IT as the interacting cell type
p2 <- ggplot(
data = filt_top_scores[
cell_type %in% c("VLMC", "OPC", "Lamp5")
& int_cell_type == "L2/3 IT"
& feats %in% c("CLSTN2", "CUX2", "CBLN2")
],
mapping = aes(x = `Donor ID`, y = feats, fill = log2fc, label = lab)) +
geom_tile() +
geom_text(size = 10) +
facet_grid(.~cell_type) +
scale_fill_gradientn(colors = myPalette(100)) +
labs(x = "Donor ID", title = "Cell types whose Genes interact with L2/3 IT") +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
plot.title = element_text(hjust = 0.5, face = 'bold')
)
p2
# How about this one, DE?
de[Symbol == "RORB"][order(p_val_adj, p_val)]
That presents a pretty good preliminary finding that RORB in L2/3 IT cells is increasing in ad, specifically in the presence of Parvalbumin GABAergic neurons. It doesn’t seem to be coming from the Pvalb cells themselves, as they significantly downregulate RORB in a non-spatial manner. Be careful assigning causality here, a lot more molucular work would be needed to verify.
Important to remember, we looked at only the top markers, from a MERFISH panel specifically of cell type markers. Other spatial transcriptomic panels.
Before we wrap up, let’s clean up the environment to help Cavatica shut down
We won’t have time today to cover the fourth type of spatial data: Cell-cell ligand receptor interactions
These logically take the next step, although it gets even more complicated to consider the co-occurrence of a target cell expressing an ICF (ligand), and an interacting cell expressing it’s own ICF (receptor). These analyses are explored in detail in both single-cell and spatial analyses, allowing to address a key concern…
Takeaway
interacting cells in spatial data can “express” genes from the other cell type?
No, they likely don’t. What is more likely is that cell segmentation has drawn a cell boundary over the overlap of two cell types. This is a key reason to always validate spatial findings elsewhere. Single-cell/nuc provides a pretty robust check, as fluid dynamics have separated those cells (assuming you filtered your doublets).
This can be done in any number of platforms, there is a whole paper that is
already out of date on the subject. This field is moving fast!
In my case, I am presenting at AAIC our work with liana that aggregates multiple software and can also incorporate 4D tensor differential analysis to find ligand-receptor pairs and source-target cell types that are differentiallt enriched in single-cell/nucleus data
To summarize, for tau pathology mice, for Glutamatergic neurons the have ligand receptor enrichments for the disease, that are in some cases reversed by treatment with a Longo Lab therapeutic, LM11A-31.
Notably, both L4 IT CTX and L2/3 IT CTX neurons independently differentially express Apoe. Making it a really compelling target tosee if this relationship is enriched when the two cell types come closer to one another.
Indeed, using the CosMx 1k Mouse Neuroscience Panel, we can look at a
fair number of relevant molecules.
Still missing a bunch though, including key ones. Hopefully as 5k, 6k, and whole transcriptomic panels emerge, they will make this even more compelling. However, more genes does mean more computationally intensive.
As new tools come online, you can further explore the relevant consequences of these spatial relationships. For instance, Spacia uses a Bayesian multiple instance learning to define a target cell’s receptor engagement from multiple surrounding cells to determing a greater level of context to cell-cell communication.
Tools like Spacia can identify genes in Astrocytes, that are correlated to Apoe expression in L4 IT CTX neurons
And future frontiers are already here on multimodal analyses (I hear
there is a session on that later in the week…)
Ultimately, I hope this gave you a start on concepts in this space and how to approach them with some existing tools. New tools will emerge, but if you can understand what information these types of findings can tell you:
Single-cell/nucleus transcriptomics
Spatial transcriptomics
Then you will be able to use that boat-load of money you just spent!
Okay, so what if you want to try this outside of the course? Since that data is publicly available. All you need is the script to build the data. Happy coding…